Exploring mixture estimators in stratified random sampling

Advancements in sensor technology have brought a revolution in data generation. Therefore, the study variable and several linearly related auxiliary variables are recorded due to cost-effectiveness and ease of recording. These auxiliary variables are commonly observed as quantitative and qualitative (attributes) variables and are jointly used to estimate the study variable’s population mean using a mixture estimator. For this purpose, this work proposes a family of generalized mixture estimators under stratified sampling to increase efficiency under symmetrical and asymmetrical distributions and study the estimator’s behavior for different sample sizes for its convergence to the Normal distribution. It is found that the proposed estimator estimates the population mean of the study variable with more precision than the competitor estimators under Normal, Uniform, Weibull, and Gamma distributions. It is also revealed that the proposed estimator follows the Cauchy distribution when the sample size is less than 35; otherwise, it converges to normality. Furthermore, the implementation of two real-life datasets related to the health and finance sectors is also presented to support the proposed estimator’s significance.


Introduction
Sampling is a procedure of selecting a representative fraction of a population so that one may observe and estimate something about the characteristics of interest for the entire population [1,2].A primary objective of survey sampling is to achieve a practical design for the survey to attain an adequate sample size for estimating the parameters of interest for the population under study.Survey sampling has several advantages over a full population study, including lower resource consumption, shorter turnaround times, and lower costs [3].Moreover, it also provides a basis for acquiring precise and useful parameter estimations.
Additionally, survey sampling makes generating accurate and efficient parameter estimates easier.Survey statisticians can improve an estimator's efficiency by enhancing the sampling technique, boosting the sample size, or employing auxiliary data.Auxiliary information about the population can consist of a known variable to which the study variable is approximately related.Typically, this auxiliary information is easy to quantify, whereas measuring the study variable itself can be costly [4].By using this additional data, which includes characteristics and variables that are highly correlated with the variable of interest, the estimation process can improve the accuracy of the study variable's mean [5,6].While measuring the study variable can often be expensive, the auxiliary information is usually easy to quantify.The i th element of Y may be correlated with one or more auxiliary variables (X i ) in addition to the study variable Y.The average elevation, area, and type of vegetation in a cattle field are examples of auxiliary variables that may be included if the study variable is the number of animals in the field.Auxiliary data is used in survey sampling for three main purposes: pre-selection, selection (i.e., selecting units that correspond to the study variable based on the auxiliary variable), and estimation (i.e., creating estimators of the ratio, product, and regression type).
In many studies, survey statisticians have been keenly interested in estimating parameters for the heterogeneous population.Therefore, for the heterogeneous population, Neyman [7] introduced Stratified Random Sampling (StRS), in which the population is partitioned into groups called "strata."Then a sample is chosen by some pattern within each stratum, and independent selections are made in different groups.Stratification is a probability sampling design used to increase the precision of estimation [8].StRS's sampling methodology ensures that every demographic group is adequately represented in the sample.For small sample sizes, adequate precision and accuracy may be achieved by using StRS's procedure.In sample selection, the StRS design helps to minimize bias.From an organizational point of view, stratified sampling is very convenient.Moreover, it is also recommended that auxiliary information be used in the StRS population parameter estimation process, which is useful for comparing estimates among several population groups.For more recent studies on StRS, see studies [9][10][11] and references therein.
Historically, numerous researchers have independently utilized auxiliary variables and attributes, proposing diverse estimators within the StRS framework to enhance efficiency [12].Cochran [13] developed classical ratio and regression methods to calculate the study variable's population mean.Graunt [14] was the first known user of the ratio estimator.When estimating using a ratio, Cochran first used auxiliary information.To calculate the population mean, Robson [15] suggested product type estimators by incorporating the ancillary data.Kadilar and Cingi [16] proposed the estimator when the population coefficient of skewness and kurtosis are unknown in stratified random sampling.In the estimation of population mean three cases of using auxiliary information have been suggested by Samiuddin and Hanif [17], such as no information, full information, and partial information.Ahmad, Hanif [18] suggested a modified and efficient estimator of the population mean using two auxiliary variables in survey sampling.Ahmad, Hanif [19] established a generalized multi-phase multivariate regression estimator with the help of several auxiliary variables.To find the population mean, Moeen, Shahbaz [20] suggested mixture estimators by simultaneously utilizing auxiliary variables and attributes.Double-phase and multi-phase sampling of a vector of variables of interest are used to calculate the population mean.Malik and Singh [21] also worked on the stratified sampling estimator.In single-phase sampling, Verma, Sharma [22] suggested some modified regression-cum-ratio, and exponential ratio type estimators.The improved version of the exponential estimator of the mean under StRS, initially proposed by Zaman [23], is presented by Singh, Ragen [24], and its properties for big sample approximation.Singh, Ragen [24] also suggested a class of estimators of population variance along with their asymptotic properties.Zaman [25] established a class of ratio-type estimators with the help of auxiliary attributes to calculate the population's mean.Under the StRS design, mixture regression cum ratio estimators in a single-phase scheme were established by Moeen [26].Yadav and Zaman [27] suggested a class of estimators using both conventional and non-conventional auxiliary variable parameters.Under StRS design, using an auxiliary attribute, Zaman and Kadilar [28] suggested exponential ratio estimators for stratified two-phase sampling.Lawson and Thongsak [29] introduce a novel set of population mean estimators designed for use in stratified random sampling.Their study examines the bias and mean square error of these estimators using Taylor series approximation.Through simulation and application to air pollution data in northern Thailand, they evaluate the performance of the estimators.Results from the air pollution data show that the proposed estimators outperform others in terms of efficiency.
Many sampling survey investigations seek to devise an efficient estimator for the population mean.Numerous studies have been designed to pursue this aim, incorporating various adaptations to classical ratio, product, and regression estimators utilizing SRS and StRS.Kadilar and Cingi [16] and Zaman [25] employed auxiliary variables and attributes, respectively, proposing different estimators within the StRS framework.However, these estimators prove beneficial only in specific scenarios, such as when auxiliary variables or attributes are solely utilized to estimate the population mean of the study variable.Consequently, a gap exists in the literature concerning the simultaneous utilization of auxiliary attributes and auxiliary variables alongside the study variable to estimate the population parameter.For example, in a household survey, income, expenditures, family size, number of employed, and number of literate persons are related variables that can be used as auxiliary information for estimating any characteristic.So, from the above example, we can use income (a quantitative variable) and family size (a qualitative variable) simultaneously to estimate the expenditure (a study variable).Therefore, the current article aims to propose a class of generalized mixture ratio estimators to estimate the population mean of the study variable by simultaneously incorporating the auxiliary attributes (qualitative) and variables (quantitative) in stratified random sampling (StRS).Therefore, the suggested estimators could be used in various sampling surveys.
The subsequent sections of the paper are organized as follows: Section: "Notations under Stratified Random Sampling" offers an overview of stratified random sampling.Section: "A Family of Proposed Estimators under Stratified Random Sampling" introduces the proposed estimators.In Section: "A Simulation Study", a comparative analysis is conducted based on the simulation study.Section: "Illustrative Examples" showcases two real-life examples.Finally, Section: "Conclusion and Recommendations" outlines the conclusions drawn from the study and provides recommendations for future research.

Notations under stratified random sampling
Let N denote the size of the population, s say that an attribute j = 1,2,3,. ..,m is dichotomized in the population based on its presence or absence; the attribute's values should be "0" and "1" correspondingly.
if h th unit of the population possesses attribute otherwise ( Under stratified random sampling, we take into account the following notations in order to determine the biases and mean square error (MSE) of the suggested estimator:

Relevant estimators under StRS
This section discusses some existing estimators and their bias and mean square error (MSE).Kadilar and Cingi [16] proposed the following estimators: In stratified random sampling by using the auxiliary attribute, Zaman [25] proposed the following estimators:

A family of proposed estimators under stratified random sampling
Within this section, we introduce a novel class of generalized mixture estimator, building upon the framework proposed by Zaman [25]; this estimator will be suitable for the estimation of the population mean � Y , incorporating the concurrent utilization of auxiliary variables and attributes within a stratified random sampling context. where Given that where a and K are constant, with a taking on the values 0 and 1 and K 2 R. Consequently, K 1 ,K 2 ,K 3 ,and K 4 may consist of any real number.Eq (19) can be rewritten as follows using Eq (20): However, for the h th stratum, let T KM h be a mixture estimator of the population mean given as follows:

� �
Similarly, we can rewrite Eq (23) as follows by using � y h as a common factor: Using notations from preceding section, the bias of T KM h is derived as follows, and after simplification, we obtain, Similarly, using notations from preceding section and after simplification, we get, Hence, the Bias expression of T KM st is obtained as follows: and : ð27Þ Moreover, the following simplification can be used to determine the maximum value of " a ": Adding the value of a in Eq (27) and after simplification, we obtained, Hence, the mean square error (MSE) expression of T KM st is obtained as follows: The complete derivation of BiasðT KM h Þ and MSEðT KM h Þ given in Eqs (25)(26)(27)(28)(29) are provided in S1 Appendix.

Unique cases of the proposed mixture estimator
In this section, we will delve into specific cases of the proposed mixture estimator, examining various combinations of constants.While Tables 1 and 2 highlight certain special cases of the estimator, exploring additional combinations of constants can reveal further special cases not covered in the tables.

A simulation study
We conducted a comprehensive simulation study to assess the proposed estimator's effectiveness.This involved comparing the performance of our estimator with that of several alternative estimators under StRS conditions.The percent relative efficiency (PRE) served as the criterion for evaluating estimator performance.We followed the steps outlined below to compute the PREs for our proposed estimator under StRS.
1.A simulated population comprising 1500 observations is established for the study variable Y. X serves as an auxiliary variable, while P represents an auxiliary attribute.Random selections are made from Normal and Uniform (symmetric), Gamma, and Weibull (asymmetric) distributions across 10,000 samples of specified sizes.Table 3 outlines the methodology for utilizing the Bernoulli distribution with specific parameter values to generate the auxiliary attribute.Moreover, population size is considered as N = N 1 + N 2 = 800 + 700 = 1500.
2. The given proportional allocation formula is used to determine the sample size from the stratum.
Using StRS, the population was divided into two strata: 800 and 700.Further, 10,000 random samples of size 20 are drawn by taking 12 units from stratum one and 8 units from stratum two.Next, 10,000 random samples of size 50 are drawn by taking 30 units from stratum one and 20 units from stratum two.Similarly, using the proportional allocation scheme, the next 10,000 random samples of size 80 are obtained by selecting 50 units from stratum one and 30 units from stratum two.Next, 10,000 random samples of size 200 are drawn by taking 120 units from stratum one and 80 units from stratum two.Further next, 10,000 random samples of size 400 are drawn by taking 280 units from stratum one and 120 units from stratum two.The results are shown in Table 4.
1.The PREs of estimators are calculated by using the following expression: where ây; and "i" stands for the estimator, whose performance needs to be compared.To check the efficiency of proposed estimator the expression of MSE given in Eq (30) has been utilized.Tables 5 and 6 present the PREs of the proposed estimator in comparison to those of competing estimators.We also assessed the impact of sample size on MSEs and employed sample sizes of 20, 50,80, 200, and    when compared to the other estimators under consideration.Additionally, it was noted that the percent relative efficiency (PRE) values demonstrated an increase as sample sizes increased, as illustrated in Fig 1.

Exploring the Best-Fitted distribution
This section explores the most appropriate probability distribution for the proposed generalized mixture estimator across various sample sizes using EasyFit [30].EasyFit employs two methods, namely the Kolmogorov-Smirnov and Anderson-Darling tests, for this purpose.The auxiliary variables are generated from a normal distribution, while the auxiliary attribute is generated from a binomial distribution using R language (version 4.2.2), with parameters specified in Table 7.We select one thousand samples of sizes n = 20, n = 50, and n = 80 and then construct the sampling distribution of the proposed estimator.As depicted in Fig 2, the proposed generalized mixture estimator conforms to the Normal distribution, with the scale parameter and location parameter computed as 3.13 and 5.30, respectively.

Exploring the distribution of the proposed estimator for different sample sizes
This section examines how the suggested generalized mixture estimator's distribution behaves for different sample sizes.
Table 8 displays the probability distribution of the estimator for n = 20, 50, and 80.When the initial data is derived from the Normal distribution for a sample size of n = 20, the proposed estimator follows the Cauchy distribution, with the cutoff point being n = 35, for which  the proposed estimator's probability distribution is converged to be Normal.Hence, the proposed estimator follows the normal distribution for n = 50 and 80. Table 9 illustrates the probability distributions of the proposed estimator against each value of n.The Kolmogorov-Smirnov results indicate that the p-values exceed 5%, supporting the hypotheses (H 0 ) that the data adhere to the specified distribution.

Illustrative examples
In this section, we present two real-life examples to illustrate the practical application of the proposed estimator.The distribution of study and auxiliary variables in each of the two data sets is discovered using EasyFit version 5.5 professional.The details of each of the data sets are given below.

Data-I: Tumor data
Data-I has been taken from Andersen, Borgan [31].We are interested in estimating the average thickness of the tumor by including auxiliary information.The data consists of 205 entities.StRS has been used, and the given variables and attributes have been considered.Y: Thickness of tumor (mm), X: Age of patient (at operation time), and P: Gender (0 = Male, 1 = Female) are used as auxiliary variable and attribute.The variable 'whether a patient was ulcerated or not' has been used for stratification purposes.Those patients who are not ulcerated are placed  in stratum 1, while stratum 2 consists of the remaining patients.In Table 10, the parameters of the data have been presented.The population of size 205 is split into two strata, with sizes 115 and 90, respectively, as Table 10 illustrates.The proportional allocation scheme has been utilized to select random samples of sizes 12, 30, 50, and 8, 20, 30, and PREs are computed.In stratum 1, the average thickness of the tumor is 1.8113, while in stratum 2, its value is 4.3433.The variations in the tumor thickness in strata 1 and 2 are 4.7364 and 10.4244, respectively.The average age is 52.463, and the average attribute gender is 0.385.The variation in age is 277.95, and in gender, it is 0.238.
Table 11 shows that the original data set contained 28155 observations.The observations are divided into two strata of sizes 25631 and 2524, respectively.The proportional allocation scheme has been utilized to select random samples of sizes 12, 30, 50, and 8, 20, 30, and PREs is computed.In stratum 1, the average wage is 640.2, while in stratum 2, the average wage is 233.73.In strata 1, the variation in wage is 197379.9,and in strata 2, its value is 139839.9.The average number of years of education is 13.067, and the average SMSA is 0.7435.Finally, the variation in the number of years of education is 8.408, and in SMSA, the variation is 0.1907.
Tables 12 and 13 shows the necessary calculations (estimated means and PRE's) for proposed and comparative estimators with respect to StRS for those two real-life datasets mentioned.We used 20, 50, and 80 sample sizes for each data set.In dataset I, T KM st with n = 20, has the highest PRE, coming in at 105.42.In a similar vein, for n = 50 and 80, T KM st additionally displays 105.98 and 105.99 as dominant PRE values, respectively.Similarly, the suggested generalized mixture estimator exhibits the highest PREs in dataset II.This implies that when compared to comparative estimators, the suggested estimator performs remarkably well and efficiently.Furthermore, it is clear that the PREs rise in tandem with larger sample sizes.In Fig 3, the suggested estimator's performance is further illustrated graphically.

Conclusion and recommendations
In different scenarios, using auxiliary information is a very useful strategy to improve the estimator's efficiency.For this purpose, we have used auxiliary variables and attributes simultaneously.In this study, we introduce a generalized family of mixture estimators under Stratified Random Sampling (StRS), inspired by the work of Zaman [25], aimed at enhancing efficiency    we derive the Mean Squared Error (MSE) expressions for the proposed estimator, supported by simulation results.Notably, our proposed class of estimators outperforms competing estimators in terms of Percent Relative Efficiency (PRE).Through simulation studies and real-life applications in the health and finance sectors, we demonstrate that our proposed estimator consistently delivers superior results compared to competitors across Normal, Uniform, Weibull, and Gamma distributions.Ultimately, we conclude that the efficiency of the suggested estimator holds both theoretically and in practical settings.Moreover, we suggest extending the scope of this study to include other types of estimators, such as ratio, product, power, difference, exponential, and regression estimators, under stratified random sampling.

Fig 2 .
Fig 2. Probability distribution of generalized proposed mixture estimator.https://doi.org/10.1371/journal.pone.0307607.g002 across symmetric and asymmetric distributions for estimating finite population means.Additionally, we analyze the estimator's behavior across various sample sizes regarding its convergence to the Normal distribution.Our findings indicate that the proposed mixture estimator adheres to the Normal distribution for sample sizes greater than or equal to 35.Furthermore,

Fig 3 .
Fig 3. A comparison between the proposed estimator and competitive estimators for real-life datasets.(A) Data-I and (B) Data-II.https://doi.org/10.1371/journal.pone.0307607.g003 where the number of units in stratum h is represented by N h , n = n 1 + n 2 + n 3 +. ..‥ +n h +. ...+n k , where the number of sampling units in stratum h is represented by n h , Y stands for the study variable, X refers to the auxiliary variable, P is the population proportion auxiliary attribute, and X and xh are the variances of stratum h, C yh and C xh are the coefficients of variations in stratum h.r xy h is the correlation coefficient between X and Y in stratum h, β 2(xh) and β 1(xh) are the kurtosis and skewness in stratum h and C xyh ¼ r xyh C xh C yh Let 0 400.Under Normal distribution, for n = 20,50,80,200 and 400,the proposed generalized mixture estimator T KM st the highest PRE's are 102.98,103.35, 104.46,

Table 3 . Simulated auxiliary variables and attributes.
, and 107.31 respectively.Similarly, under the Uniform distribution, for n = 20,50,80,200 and 400, the proposed generalized mixture estimator T KM st the highest PREs were reported as 102.92, 103.56,103.91,105.56, and 106.23 respectively.Moreover, similar results are evident in Table 6, particularly under the Gamma and Weibull distributions, the proposed generalized mixture estimator T KM st demonstrates significantly superior PRE values https://doi.org/10.1371/journal.pone.0307607.t003106.21

Table 12 . Estimated sample means and proposed generalized mixture estimators' percent relative efficiencies (PREs) with comparative estimators' PREs for data set I. Estimators Estimated Sample Means Data I
https://doi.org/10.1371/journal.pone.0307607.t012